Linear Regression imposes g(.) to be a linear function of the inputs to predict the labels:
yb = g(X) = θ0 + θ1x1 + θ2x2 + . . . + θnxn
▶ yb is the predicted value for the label.
▶ n is the number of features.
▶ xj is the j th input feature.
▶ θk is the k th model parameter, including the bias term θ0.
In Machine Learning applications, we usually opt for a generic optimization algorithm that is capable of finding optimal solutions to a wide range of problems: Gradient Descent.
Gradient descent is an algorithm that helps the network adjust its weights and biases in the right direction to minimize the cost function.
Different ways to avoid overfitting:
▶ Gather more training data to reduce noise or clean the data to fix errors or remove outliers.
▶ Regularization. Force the trainable parameters to be relatively small and avoid overfitting.
▶ Early stopping. Introduce a validation sample to stop training as soon as the model starts to generalize badly.
▶ Tuning of hyperparameters. Determine the combination that best fits the trained model in the validation sample.
Hyperparameter: a parameter of a learning algorithm that does not explicitly serve to predict the labels. As such, hyperparameters are fixed during training.